World University Rankings - Summer Camp 2023¶

Getting Started¶

Data Engineers play a critical role in developing products that are designed to help unlock the power of data. In this project we will focus on few keys areas within the space of Data Engineering and Business Intelligence

  1. Data Loading
  2. Data Exploration
  3. Data Analysis
  4. Data Visualization

Before we start, please follow the steps below.

Step 1: Prerequisites¶

Install Python on you computer - We will provide a handout with steps to download the python to your machine

Step 2: Python Environment¶

Please install the required modules needed to run the python program

Go to Terminal for MAC or Command Prompt for Windows and run below commands

  1. Pandas - Python Dataframe pip install pandas

  2. Plotly - Library to represent data graphically pip install plotly

Note: We will share instructions for installation separately

Lets Start Writing the Code....

Import the Python Packages¶

In [1]:
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
In [2]:
import plotly.express as px

Data Loading¶

For the purpose of this exercise, we will use World University Rankings dataset available on Kaggle: https://www.kaggle.com/datasets/r1chardson/the-world-university-rankings-2011-2023

Please download the files and store it in your computer.

We are looking at data from 2022. Please use other years' data as needed.

We will use pandas library to read a CSV format and leverage built-in functions to investigate the data.

Once the file are downloaded to your local machine. Provide a Path

Sample : /Users/username/Downloads/NCWiT Summer 2023/University rankings/2022_rankings.csv

In [3]:
# For this exercise, we have stored the files in the directory named "dataset"
# Please change this to the correct path based on where you have downloaded the dataset
filepath = '../dataset/2022_rankings.csv'
df = pd.read_csv(filepath)
print(df)
      rank_order      rank                                name scores_overall  \
0             10         1                University of Oxford           95.7   
1             20        =2  California Institute of Technology           95.0   
2             30        =2                  Harvard University           95.0   
3             40         4                 Stanford University           94.9   
4             50        =5             University of Cambridge           94.6   
...          ...       ...                                 ...            ...   
2107     1000446  Reporter                    Yaşar University            NaN   
2108     1000447  Reporter                 Yenepoya University            NaN   
2109     1000448  Reporter         Yogyakarta State University            NaN   
2110     1000449  Reporter             York St John University            NaN   
2111     1000450  Reporter                 Ziauddin University            NaN   

      scores_overall_rank  scores_teaching  scores_teaching_rank  \
0                      10             91.0                     5   
1                      20             93.6                     2   
2                      30             94.5                     1   
3                      40             92.3                     3   
4                      50             90.9                     6   
...                   ...              ...                   ...   
2107              1000446              NaN                     0   
2108              1000447              NaN                     0   
2109              1000448              NaN                     0   
2110              1000449              NaN                     0   
2111              1000450              NaN                     0   

      scores_research  scores_research_rank  scores_citations  ...  \
0                99.6                     1              98.0  ...   
1                96.9                     4              97.8  ...   
2                98.9                     3              99.2  ...   
3                96.8                     5              99.9  ...   
4                99.5                     2              96.2  ...   
...               ...                   ...               ...  ...   
2107              NaN                     0               NaN  ...   
2108              NaN                     0               NaN  ...   
2109              NaN                     0               NaN  ...   
2110              NaN                     0               NaN  ...   
2111              NaN                     0               NaN  ...   

      scores_international_outlook_rank        location  \
0                                    26  United Kingdom   
1                                   167   United States   
2                                   209   United States   
3                                   211   United States   
4                                    32  United Kingdom   
...                                 ...             ...   
2107                                  0          Turkey   
2108                                  0           India   
2109                                  0       Indonesia   
2110                                  0  United Kingdom   
2111                                  0        Pakistan   

      stats_number_students  stats_student_staff_ratio  \
0                    20,835                       10.7   
1                     2,233                        6.3   
2                    21,574                        9.5   
3                    16,319                        7.3   
4                    19,680                       11.1   
...                     ...                        ...   
2107                  6,847                       13.0   
2108                  3,104                        6.1   
2109                 24,988                       20.3   
2110                  6,030                       18.0   
2111                  4,018                        9.3   

      stats_pc_intl_students stats_female_male_ratio  \
0                        42%                 47 : 53   
1                        34%                 36 : 64   
2                        24%                 50 : 50   
3                        23%                 46 : 54   
4                        39%                 47 : 53   
...                      ...                     ...   
2107                      2%                 53 : 47   
2108                      0%                 67 : 33   
2109                      1%                 72 : 28   
2110                      8%                 66 : 34   
2111                      1%                 60 : 40   

                                         aliases  \
0                           University of Oxford   
1     California Institute of Technology caltech   
2                             Harvard University   
3                            Stanford University   
4                        University of Cambridge   
...                                          ...   
2107                            Yaşar University   
2108                         Yenepoya University   
2109                 Yogyakarta State University   
2110                     York St John University   
2111                         Ziauddin University   

                                       subjects_offered closed unaccredited  
0     Accounting & Finance,General Engineering,Commu...  False        False  
1     Languages, Literature & Linguistics,Economics ...  False        False  
2     Mathematics & Statistics,Civil Engineering,Lan...  False        False  
3     Physics & Astronomy,Computer Science,Politics ...  False        False  
4     Business & Management,General Engineering,Art,...  False        False  
...                                                 ...    ...          ...  
2107  Art, Performing Arts & Design,Mechanical & Aer...  False        False  
2108  Medicine & Dentistry,Biological Sciences,Other...  False        False  
2109  Civil Engineering,Physics & Astronomy,Educatio...  False        False  
2110  Biological Sciences,General Engineering,Geogra...  False        False  
2111  Business & Management,Biological Sciences,Medi...  False        False  

[2112 rows x 24 columns]

Data Exploration¶

Now that we have loaded the data, its time for us to explore the dataset and understand the key elements.

In [4]:
df.shape
Out[4]:
(2112 (0x840), 24 (0x18))

We see that the data has 2112 rows and 24 columns. Let us print the first 5 rows (along with the columns). This will give us an idea about what the data looks like.

In [5]:
df.head(5)
Out[5]:
rank_order rank name scores_overall scores_overall_rank scores_teaching scores_teaching_rank scores_research scores_research_rank scores_citations ... scores_international_outlook_rank location stats_number_students stats_student_staff_ratio stats_pc_intl_students stats_female_male_ratio aliases subjects_offered closed unaccredited
0 10 1 University of Oxford 95.7 10 91.0 5 99.6 1 98.0 ... 26 United Kingdom 20,835 10.7 42% 47 : 53 University of Oxford Accounting & Finance,General Engineering,Commu... False False
1 20 =2 California Institute of Technology 95.0 20 93.6 2 96.9 4 97.8 ... 167 United States 2,233 6.3 34% 36 : 64 California Institute of Technology caltech Languages, Literature & Linguistics,Economics ... False False
2 30 =2 Harvard University 95.0 30 94.5 1 98.9 3 99.2 ... 209 United States 21,574 9.5 24% 50 : 50 Harvard University Mathematics & Statistics,Civil Engineering,Lan... False False
3 40 4 Stanford University 94.9 40 92.3 3 96.8 5 99.9 ... 211 United States 16,319 7.3 23% 46 : 54 Stanford University Physics & Astronomy,Computer Science,Politics ... False False
4 50 =5 University of Cambridge 94.6 50 90.9 6 99.5 2 96.2 ... 32 United Kingdom 19,680 11.1 39% 47 : 53 University of Cambridge Business & Management,General Engineering,Art,... False False

5 rows × 24 columns

To get the number of rows and columns, use the shape function.

In [6]:
rows = df.shape[0]
cols = df.shape[1]
In [7]:
print(rows)
2112
In [8]:
print(cols)
24

Columns in the dataset¶

Let us also take a look at what columns exist in the dataset.

In [9]:
df.columns
Out[9]:
Index(['rank_order', 'rank', 'name', 'scores_overall', 'scores_overall_rank',
       'scores_teaching', 'scores_teaching_rank', 'scores_research',
       'scores_research_rank', 'scores_citations', 'scores_citations_rank',
       'scores_industry_income', 'scores_industry_income_rank',
       'scores_international_outlook', 'scores_international_outlook_rank',
       'location', 'stats_number_students', 'stats_student_staff_ratio',
       'stats_pc_intl_students', 'stats_female_male_ratio', 'aliases',
       'subjects_offered', 'closed', 'unaccredited'],
      dtype='object')

Checking Null values in dataset¶

In [10]:
nan_values=df.isna()
nan_columns=nan_values.any()
nan_columns
Out[10]:
rank_order                           False
rank                                 False
name                                 False
scores_overall                        True
scores_overall_rank                  False
scores_teaching                       True
scores_teaching_rank                 False
scores_research                       True
scores_research_rank                 False
scores_citations                      True
scores_citations_rank                False
scores_industry_income                True
scores_industry_income_rank          False
scores_international_outlook          True
scores_international_outlook_rank    False
location                             False
stats_number_students                False
stats_student_staff_ratio            False
stats_pc_intl_students               False
stats_female_male_ratio               True
aliases                              False
subjects_offered                     False
closed                               False
unaccredited                         False
dtype: bool

Data Analysis¶

Often, you will have a huge dataset but you will be interested in only a subset. It is helpful to understand how you can extract interesting/relevant data to reduce the size of the dataset that you need to work with.

For this exercise, let us focus on these columns only:

  • rank_order
  • rank
  • name
  • location
  • subjects_offered

And let us only look at the top-200 universities based on the rank.

To get top-N rows sorted by a column (rank_order), we will use this function: DataFrame.sort_values as described here: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html

In our case, we need to pass the following args:

  • by = 'rank_order'
  • axis = 0 (because we want to get top-N rows)

We will also use "Slicing" in Python to get the first-N values. For example, to get the first 10 elements in an array/list, you will use: arr[:10]

In [11]:
top_200_universities = df.sort_values(by='rank_order', axis=0)[:200]
In [12]:
top_200_universities.shape
Out[12]:
(200 (0xc8), 24 (0x18))

As you see above, we now have a dataframe which has only 200 rows.

Now let us extract specific columns. To do this, you will need to create a list of columns that you want to keep. Then, you will use that list to filter the original dataframe to get the columns you need.

For example, if you only need col2 and col6 from a dataframe, you will use this:

df = df[['col2', 'col6']]

In [13]:
cols_to_keep = ['rank_order', 'rank', 'name', 'location', 'subjects_offered']
In [14]:
# Create a copy so that the original dataset remains unchanged
df_2022_rankings = top_200_universities[cols_to_keep].copy()
In [15]:
df_2022_rankings
Out[15]:
rank_order rank name location subjects_offered
0 10 1 University of Oxford United Kingdom Accounting & Finance,General Engineering,Commu...
1 20 =2 California Institute of Technology United States Languages, Literature & Linguistics,Economics ...
2 30 =2 Harvard University United States Mathematics & Statistics,Civil Engineering,Lan...
3 40 4 Stanford University United States Physics & Astronomy,Computer Science,Politics ...
4 50 =5 University of Cambridge United Kingdom Business & Management,General Engineering,Art,...
... ... ... ... ... ...
195 1960 196 Medical University of Graz Austria Medicine & Dentistry,Other Health
196 1970 =197 University of Erlangen-Nuremberg Germany Archaeology,Computer Science,Sport Science,Bio...
197 1980 =197 University of Geneva Switzerland Politics & International Studies (incl Develop...
198 1990 =197 University of Hohenheim Germany Business & Management,Economics & Econometrics...
199 2000 =197 Sapienza University of Rome Italy Politics & International Studies (incl Develop...

200 rows × 5 columns

As you can see above, you are left with 200 rows and 5 columns that are of interest.

Data Cleaning¶

Sometimes the data may not be in a clean form for us to use for furthur analysis. For Example rank column above you can see there is an extra "Equal to" sign. Lets remove that.

You have to access the str attribute per http://pandas.pydata.org/pandas-docs/stable/text.html

In [16]:
df_2022_rankings['rank'] = df_2022_rankings['rank'].str.replace('=', '')
In [17]:
df_2022_rankings
Out[17]:
rank_order rank name location subjects_offered
0 10 1 University of Oxford United Kingdom Accounting & Finance,General Engineering,Commu...
1 20 2 California Institute of Technology United States Languages, Literature & Linguistics,Economics ...
2 30 2 Harvard University United States Mathematics & Statistics,Civil Engineering,Lan...
3 40 4 Stanford University United States Physics & Astronomy,Computer Science,Politics ...
4 50 5 University of Cambridge United Kingdom Business & Management,General Engineering,Art,...
... ... ... ... ... ...
195 1960 196 Medical University of Graz Austria Medicine & Dentistry,Other Health
196 1970 197 University of Erlangen-Nuremberg Germany Archaeology,Computer Science,Sport Science,Bio...
197 1980 197 University of Geneva Switzerland Politics & International Studies (incl Develop...
198 1990 197 University of Hohenheim Germany Business & Management,Economics & Econometrics...
199 2000 197 Sapienza University of Rome Italy Politics & International Studies (incl Develop...

200 rows × 5 columns

Making Data-Driven Decisions¶

Now that you have a subset of the data, you can use that to answer questions that help you make decisions. Let us try to answer some questions like:

  • What are the top-10 Computer Science Universities?
  • What are the top-10 Computer Science Universities in the US?

To check whether "Computer Science" is offered in a university, we will use str.contains function for a column.

This function will return a Boolean Series containing True or False depending on whether the pattern was found or not. See documentation here: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.contains.html

We will also add another column is_computer_science that will specify whether or not Computer Science is offered for that university.

In [18]:
df_2022_rankings['is_computer_science'] = df_2022_rankings['subjects_offered'].str.contains('Computer Science')
In [19]:
df_2022_rankings
Out[19]:
rank_order rank name location subjects_offered is_computer_science
0 10 1 University of Oxford United Kingdom Accounting & Finance,General Engineering,Commu... True
1 20 2 California Institute of Technology United States Languages, Literature & Linguistics,Economics ... True
2 30 2 Harvard University United States Mathematics & Statistics,Civil Engineering,Lan... True
3 40 4 Stanford University United States Physics & Astronomy,Computer Science,Politics ... True
4 50 5 University of Cambridge United Kingdom Business & Management,General Engineering,Art,... True
... ... ... ... ... ... ...
195 1960 196 Medical University of Graz Austria Medicine & Dentistry,Other Health False
196 1970 197 University of Erlangen-Nuremberg Germany Archaeology,Computer Science,Sport Science,Bio... True
197 1980 197 University of Geneva Switzerland Politics & International Studies (incl Develop... True
198 1990 197 University of Hohenheim Germany Business & Management,Economics & Econometrics... False
199 2000 197 Sapienza University of Rome Italy Politics & International Studies (incl Develop... True

200 rows × 6 columns

As you can see above, a new column is_computer_science is created which contains True or False depending on whether the course is offered or not.

1. Top-10 Computer Science Universities¶

We will use the same "Slicing" as before to get the top-10 universities. Before that, we want to also get the universities that DO offer Computer Science. We will use the is_computer_science column from before and get only those rows where the value is True.

In [20]:
cs_universities = df_2022_rankings[df_2022_rankings['is_computer_science']==True]
In [21]:
cs_universities
Out[21]:
rank_order rank name location subjects_offered is_computer_science
0 10 1 University of Oxford United Kingdom Accounting & Finance,General Engineering,Commu... True
1 20 2 California Institute of Technology United States Languages, Literature & Linguistics,Economics ... True
2 30 2 Harvard University United States Mathematics & Statistics,Civil Engineering,Lan... True
3 40 4 Stanford University United States Physics & Astronomy,Computer Science,Politics ... True
4 50 5 University of Cambridge United Kingdom Business & Management,General Engineering,Art,... True
... ... ... ... ... ... ...
193 1940 193 Queensland University of Technology Australia Languages, Literature & Linguistics,Education,... True
194 1950 193 Texas A&M University United States Art, Performing Arts & Design,Sociology,Chemis... True
196 1970 197 University of Erlangen-Nuremberg Germany Archaeology,Computer Science,Sport Science,Bio... True
197 1980 197 University of Geneva Switzerland Politics & International Studies (incl Develop... True
199 2000 197 Sapienza University of Rome Italy Politics & International Studies (incl Develop... True

193 rows × 6 columns

As you can see, the number of rows is now reduced to 193. Now out of these, we want the top 10 ranked universities.

In [22]:
top_cs_universities = cs_universities[:10]
In [23]:
top_cs_universities[['name', 'rank', 'location']]
Out[23]:
name rank location
0 University of Oxford 1 United Kingdom
1 California Institute of Technology 2 United States
2 Harvard University 2 United States
3 Stanford University 4 United States
4 University of Cambridge 5 United Kingdom
5 Massachusetts Institute of Technology 5 United States
6 Princeton University 7 United States
7 University of California, Berkeley 8 United States
8 Yale University 9 United States
9 The University of Chicago 10 United States

2. Top-10 Computer Science Universities in the US¶

We will use the same "Slicing" as before to get the top-10 universities.

We also have the is_computer_science column from before that we will use to get universities offering Computer Science.

We now need to get universities whose location is "United States"

In [24]:
cs_universities_us = cs_universities[cs_universities['location']=='United States']
In [25]:
cs_universities_us.head(5)
Out[25]:
rank_order rank name location subjects_offered is_computer_science
1 20 2 California Institute of Technology United States Languages, Literature & Linguistics,Economics ... True
2 30 2 Harvard University United States Mathematics & Statistics,Civil Engineering,Lan... True
3 40 4 Stanford University United States Physics & Astronomy,Computer Science,Politics ... True
5 60 5 Massachusetts Institute of Technology United States Mathematics & Statistics,Languages, Literature... True
6 70 7 Princeton University United States Languages, Literature & Linguistics,Biological... True

As you can see, now we have a dataframe containing only universities in the US. It is straightforward to extract the top-10 using the slicing method now.

In [26]:
top_cs_universities_us = cs_universities_us[:10]
In [27]:
top_cs_universities_us[['name', 'rank', 'location']]
Out[27]:
name rank location
1 California Institute of Technology 2 United States
2 Harvard University 2 United States
3 Stanford University 4 United States
5 Massachusetts Institute of Technology 5 United States
6 Princeton University 7 United States
7 University of California, Berkeley 8 United States
8 Yale University 9 United States
9 The University of Chicago 10 United States
10 Columbia University 11 United States
12 Johns Hopkins University 13 United States

Let us cross-check the number of rows in the top CS universities. This should be equal to 10.

In [28]:
print(top_cs_universities_us.shape[0])
10

Data Visualization¶

There are multiple ways to represent the data graphically. Please go through this link to see multiple options https://plotly.com/python/plotly-express/

Now that we have Top CS Universities in US dataset avaliable readily, we can plot them in a graph

Graph Examples : Histogram, bar, Line, Area etc.

In [29]:
# Histogram Chart Below

fig = px.histogram(top_cs_universities_us, y='rank', x='name',text_auto=True)
fig.show()
In [30]:
fig = px.pie(top_200_universities, values='stats_student_staff_ratio', names='location', 
             title='Staff to Student Ratio')
fig.show()